refactor(gpu-arch): auto-detect MAD_SYSTEM_GPU_ARCHITECTURE for local full-run mode by coketaste · Pull Request #113 · ROCm/madengine

coketaste · 2026-04-24T04:00:21Z

Summary

Auto-detect GPU arch before build in full workflow (madengine run --tags): the build phase previously ran with build_only_mode=True and skipped GPU detection entirely, so
Dockerfiles with ARG MAD_SYSTEM_GPU_ARCHITECTURE (no default) were built with an empty value. Users were forced to manually supply --additional-context '{"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx942"}}' on every local run.

GPU arch auto-detection — design

Detection fires only in the RunOrchestrator._build_phase() path (full workflow). The flag is False by default so standalone madengine build is unaffected.

Detection chain reuses existing modules — no new code:

detect_gpu_vendor() (utils/gpu_validator.py) — fast, filesystem-only check, no subprocess
get_gpu_tool_manager(vendor, rocm_path) (utils/gpu_tool_factory.py) — singleton factory
manager.get_gpu_architecture() (utils/rocm_tool_manager.py / nvidia_tool_manager.py) — cached, with actionable error messages
normalize_architecture_name() (execution/dockerfile_utils.py) — consistent format

User-provided value in --additional-context is always respected and never overridden. Fails gracefully with a warning on nodes without a GPU.

Test plan

madengine run --tags <tag> on a local GPU node — build phase prints Auto-detected GPU architecture for build: gfx942, no "unresolved" warning, image builds correctly
madengine run --tags <tag> --additional-context '{"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx90a"}}' — user value respected, auto-detect skipped
madengine build --tags <tag> on a non-GPU CI node — same behaviour as before (warning if Dockerfile needs arch, no crash)
madengine run --manifest-file build_manifest.json — run-only path unchanged
Existing unit tests pass: pytest tests/unit/ tests/e2e/

… full-run mode In full workflow (madengine run --tags), the build phase ran with build_only_mode=True and skipped GPU detection, leaving Dockerfiles that declare ARG MAD_SYSTEM_GPU_ARCHITECTURE without a default built with an empty value. Users had to manually pass --additional-context every time. - Context.__init__: add detect_local_gpu_arch param (default False), thread it to init_build_context() - Context.init_build_context: add detect_gpu_arch param; when True, reuse detect_gpu_vendor() + get_gpu_tool_manager() + normalize_architecture_name() to detect and inject MAD_SYSTEM_GPU_ARCHITECTURE into docker_build_arg before the build; user-provided value is never overridden; fails gracefully - BuildOrchestrator.__init__: accept and forward detect_local_gpu_arch to Context; add resolved-arch confirmation print in execute() - RunOrchestrator._build_phase: pass detect_local_gpu_arch=True so only the full-workflow build path auto-detects; standalone madengine build is unaffected (flag defaults to False) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Copilot

Pull request overview

This PR enables auto-detection of MAD_SYSTEM_GPU_ARCHITECTURE during the build phase of the full madengine run --tags ... workflow so Dockerfiles that require this build-arg (without a default) can build successfully without forcing users to pass it via --additional-context.

Changes:

Enable GPU-arch auto-detection in RunOrchestrator._build_phase() by passing a new detect_local_gpu_arch=True flag into BuildOrchestrator.
Add detect_local_gpu_arch plumbing through BuildOrchestrator into Context and implement optional build-only GPU arch injection in Context.init_build_context().
Print a “resolved” message when MAD_SYSTEM_GPU_ARCHITECTURE is present in docker_build_arg before building.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`src/madengine/orchestration/run_orchestrator.py`	Turns on build-time GPU arch auto-detection for full build+run workflows by passing `detect_local_gpu_arch=True`.
`src/madengine/orchestration/build_orchestrator.py`	Adds `detect_local_gpu_arch` parameter, forwards it into `Context`, and prints when the arch is resolved.
`src/madengine/core/context.py`	Implements optional build-only detection of GPU arch and injects it into `ctx["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"]`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ild_context Five new tests in TestBuildContextGpuArchAutoDetect cover the detect_gpu_arch path added in 78b5110: - arch is injected when MAD_SYSTEM_GPU_ARCHITECTURE is absent - user-provided value is preserved (no override) - UNKNOWN vendor emits a warning and leaves the key unset - detection exceptions are caught and warned, not raised - detect_gpu_arch=False skips vendor detection entirely Also introduces _make_build_only_ctx() helper to safely suppress __init__'s init_build_context call via a scoped patch.object context manager, avoiding the previous broken pattern of calling .stop() on a MagicMock (which is a no-op and left the real method never invoked). Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T20:56:48Z


 import pytest
-from unittest.mock import Mock, patch
+from unittest.mock import Mock, MagicMock, patch


Mock is imported but never used in this test module. This will trigger flake8 F401 (the repo has a flake8 pre-commit hook in .pre-commit-config.yaml). Remove the unused import or use it.

Suggested change

from unittest.mock import Mock, MagicMock, patch

from unittest.mock import MagicMock, patch

Copilot · 2026-04-27T20:56:49Z

+            resolved_arch = self.context.ctx.get("docker_build_arg", {}).get("MAD_SYSTEM_GPU_ARCHITECTURE")
+            if resolved_arch:
+                self.rich_console.print(
+                    f"[green]✓ MAD_SYSTEM_GPU_ARCHITECTURE resolved: {resolved_arch}[/green]\n"
+                )


The build phase banner says “(Build-only mode - no GPU detection)”, but when detect_local_gpu_arch is enabled the build-only Context now performs GPU architecture detection (and these new lines print a resolved MAD arch). Consider updating the banner/message to reflect conditional detection so the output isn’t misleading.

Copilot · 2026-04-27T20:56:49Z

            build_only_mode: Whether running in build-only mode (no GPU detection).
            rocm_path: Optional ROCm installation path (overrides ROCM_PATH env; default /opt/rocm).
+            detect_local_gpu_arch: When True and in build_only_mode, attempt to auto-detect
+                MAD_SYSTEM_GPU_ARCHITECTURE from the local node and inject it into docker_build_arg.
+                Has no effect when build_only_mode=False (runtime mode detects it via init_gpu_context).


The build_only_mode arg is described as “no GPU detection”, but build-only mode can now still do GPU architecture detection when detect_local_gpu_arch=True. Consider updating this docstring wording so it matches the new behavior (e.g., “skips runtime GPU context; may optionally detect build arch”).

coketaste self-assigned this Apr 24, 2026

Copilot AI review requested due to automatic review settings April 24, 2026 04:00

coketaste requested review from Cemberk, Rohan138, gargrahul and leconcio as code owners April 24, 2026 04:00

Copilot started reviewing on behalf of coketaste April 24, 2026 04:01 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread src/madengine/orchestration/build_orchestrator.py

coketaste requested a review from Copilot April 24, 2026 19:45

Copilot started reviewing on behalf of coketaste April 24, 2026 19:46 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread src/madengine/orchestration/run_orchestrator.py

Comment thread src/madengine/core/context.py

coketaste and others added 2 commits April 27, 2026 12:55

Updated changelog

dafe88d

Copilot AI review requested due to automatic review settings April 27, 2026 20:50

coketaste merged commit 8467319 into develop Apr 27, 2026
1 check passed

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(gpu-arch): auto-detect MAD_SYSTEM_GPU_ARCHITECTURE for local full-run mode#113

refactor(gpu-arch): auto-detect MAD_SYSTEM_GPU_ARCHITECTURE for local full-run mode#113
coketaste merged 3 commits intodevelopfrom
coketaste/mad-system-gpu-arch

coketaste commented Apr 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	from unittest.mock import Mock, MagicMock, patch
	from unittest.mock import MagicMock, patch

Conversation

coketaste commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

GPU arch auto-detection — design

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coketaste commented Apr 24, 2026 •

edited

Loading